December 17, 2019

Ten Tips for Data Analysis

Anurag Sharmaanu0012

DURATION

10 min

1. Check for null values

Once we read the dataset, we should always check for null values. The following line of code can be used for this:

df.isnull().sum(axis=0)

It gives the total number of null values for each column in the dataset.

2. Check for duplicate values

We can check whether there are duplicate values in the dataset using the following command:

df.duplicated().sum()

It gives us the total number of duplicated rows. We can drop duplicates using the following command:

df.drop_duplicates(keep=False)

3. Check for unique values

This is useful for categorical data. The number of unique values can be checked in a particular column using the following code:

df['column_name'].unique()

4. Replace/drop null values

Dealing with null values is also very important for data analysis. We can either remove the column if there are too many null values

df['column_name'].dropna(inplace=True)

or we can impute them with mean/median/mode.

df['column'].fillna(df['column'].mean(),inplace=True)

5. Correlation matrix

This gives us the information about the correlation between different features. The following command can be used for it:

1
2
3
4
import seaborn as sns
corrmat = df.corr()
f, ax = plt.subplots(figsize = (12, 9))
sns.heatmap(corrmat, square = True);

We are using seaborn package to plot the matrix.

6. Check distribution of a variable

We can also check for the distribution of a particular column. By plot we can understand whether the data is following a normal distribution or some other. Following is the command to use:

1
2
import seaborn as sns
sns.distplot(df['column'])

7. Check datatypes of columns

We can check the datatype of columns in the data. It helps us to confirm whether data is the correct data type or not.

df.dtypes

8. Deal with datetime columns

Many times we get dates in a column but not in datetime format. We can convert them into the correct format by using the following code:

df["Date.of.Birth"] = df['Date.of.Birth'].astype('datetime64[ns]')

9. Value count

We can check the frequency of different categories in a categorical column:

df['column'].value_counts()

10. Max/min value of a column

We can find out the maximum and minimum values in a particular column using max() and min() command:

df['column'].max()
df['column'].max()

You can check out my sample notebook with all these commands in action with a dataset.

Jupyter Notebook

That’s it for now. Next time we’ll learn some more tips and tricks for data analysis. :)

Chat on Discord